Han, Bo, Paul Cook and Timothy Baldwin (2013) unimelb: Spanish Text Normalisation, In Proceedings of the Tweet Normalization Workshop at SEPLN 2013 (Tweet-norm), Madrid, Spain, pp. 67-71
نویسندگان
چکیده
This paper describes a lexicon-based text normalisation approach for Spanish tweets. We first compare English and Spanish text normalisation, and hypothesise that an approach previously proposed for English can be adapted to Spanish. A corpus-derived normalisation lexicon is built using distributional similarity, and is combined with existing lexicons (e.g., containing Spanish Internet slang). These lexicons enable a very fast, look-up based approach to text normalisation. Experimental results indicate that the corpus-derived lexicon complements existing lexicons, but that the approach could be improved through better handling of certain word types, such as named entities.
منابع مشابه
unimelb: Spanish Text Normalisation
This paper describes a lexicon-based text normalisation approach for Spanish tweets. We first compare English and Spanish text normalisation, and hypothesise that an approach previously proposed for English can be adapted to Spanish. A corpus-derived normalisation lexicon is built using distributional similarity, and is combined with existing lexicons (e.g., containing Spanish Internet slang). ...
متن کاملDLSI en Tweet-Norm 2013: Normalización de Tweets en Español
The lexical richness and its ease of access to large volumes of information converts the Web 2.0 into an important resource for Natural Language Processing. Nevertheless, the frequent presence of non-normative linguistic phenomena that can make any automatic processing challenging. In this paper is described the participation in the Text Normalisation Workshop at the SEPLN conference (Tweet-nor...
متن کاملLexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-specific Edit Distances, and Language Models
We present a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system’s results at SEPLN 2013 Tweet-Norm task were above-average.
متن کاملElhuyar at Tweet-Norm 2013
This paper presents the system developed by Elhuyar for the TweetNorm evaluation campaign which consists of normalizing Spanish tweets to standard language. The normalization covers only the correction of certain Out Of Vocabulary (OOV) words, previously identified by the organizers. The developed system follows a two step strategy. First, candidates for each OOV word are generated by means of ...
متن کاملThe TALP-UPC Approach to Tweet-Norm 2013
This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose different corrections for each out-of-vocabulary word. The final correction is chosen by weighted voting according to each module accuracy.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013